Midterm Project: Effects of Air Pollution on Countries
1. Introduction
Motivation
Climate issues are pervasive but typically disproportionately affect low income communities and developing countries. Our group wanted to explore how air pollution has changed over time and affect countries differently. Specifically, we wanted to analyze how a country’s economic and social position can either increase, decrease, or not have observable impact on the affects of air pollution. In laymen terms, does air pollution affect underdeveloped countries disproportionately?
Set Up
Before we start, we need to ensure that we have all the relevant libraries installed and imported.
Run these in the console, or only the ones that your system does not have, to install packages in addition to the ezids package.
install.packages("tidyverse")
install.packages("rworldmap")
install.packages("tmap")
install.packages("spData")
install.packages("sf")
install.packages("ggpubr")
install.packages("dplyr")
install.packages("knitr")
install.packages("magrittr")
2. Data Sources and Data Wrangling
Data Sources
For our analysis, we will be working with 5 main data sources shown in the table below:
| Data | Source | Link |
|---|---|---|
| Deaths Due to Air Pollution of Countries from 1990 - 2017 | Kaggle | Link |
| GDP Annual Growth of Countries from 1960 - 2020 | Kaggle via WorldBank | Link |
| United Nations Population and Region Data | United Nations | Link |
| United Nations ISO-alpha3 code | United Nations | Link |
| spData for Map Geometries | spData for Mapping | Link |
The main variables in our datasets will include:
| Feature | Data Type | Unit of Measure | Notes and Assumptions |
|---|---|---|---|
| GDP (Gross Domestic Product) | Numerical, Continuous | $USD | This is our chosen proxy for measuring a country’s economic status |
| Population Size | Numerical, Continuous | thousands of people | Annual UN estimated |
| Deaths due to Air Pollution | Numerical, Continuous | deaths per million | This is our chosen proxy for measuring the negative affects of air pollution. |
| Country | Qualitative, Categorical | N/A | 231 countries |
| SDG Region | Qualitative, Categorical | N/A | UN’s Sustainable Development Goals Region Classification. |
| Sub Region | Qualitative, Categorical | N/A | UN’s Sustainable Development Goals Sub-Region Classification. |
| ISO-alpha3 Country Code | Qualitative, Categorical | N/A | Standard for identifying countries (text ID). |
| ISO-alpha2 Country Code | Qualitative, Categorical | N/A | Another standard for identifying countries (text ID). |
| M49 Country Code | Numerical, Categorical | N/A | Another standard for identifying countries (numerical ID). |
| Year | Numerical, Categorical | N/A | 1990 to 2017 |
| GDP per Capita | Numerical, Continuous | $USD per person | Normalization of GDP to compare between population sizes (calculated). |
Data Wrangling
While data from Kaggle are already in a format to be cleaned, downloaded data from United Nations required a little data wrangling. Mainly, we needed to extract just countries’ data from the Excel workbooks and into their own contained csv files. Since we only need to do this once and programming it would take significant time to choose the specific cells that we need, we opted to perform this step outside of R and in Excel. Note that if this were a part of a real production data pipeline, we would take the time to program the data extraction but would likely choose a different programming language such as Python that is a bit more robust in these types of tasks like web scraping and data transformations in Pandas.
- Figure 3: Sample screenshot of data downloaded from UN including unnecessary elements like banners and other regional data.
- Figure 4: Sample screenshot of transformed UN dataset.
3. Load, Clean, and Inspect Data
Load Data
| variable | class | first_values |
|---|---|---|
| Country.or.Area | character | Andorra, United Arab Emirates (the), Afghanistan, Antigua and Barbuda, Anguilla, Albania |
| ISO.alpha2.code | character | AD, AE, AF, AG, AI, AL |
| ISO.alpha3.code | character | AND, ARE, AFG, ATG, AIA, ALB |
| M49.code | integer | 20, 784, 4, 28, 660, 8 |
| variable | class | first_values |
|---|---|---|
| Entity | character | Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan |
| Code | character | AFG, AFG, AFG, AFG, AFG, AFG |
| Year | integer | 1990, 1991, 1992, 1993, 1994, 1995 |
| Air.pollution..total…deaths.per.100.000. | double | 299.477308883281, 291.277966734046, 278.963055615066, 278.790814746341, 287.162923177255, 288.01422374243 |
| Indoor.air.pollution..deaths.per.100.000. | double | 250.362909742375, 242.575124973334, 232.043877894811, 231.648133503794, 238.837176822107, 239.906598716878 |
| Outdoor.particulate.matter..deaths.per.100.000. | double | 46.4465894382846, 46.0338405670284, 44.2437660321924, 44.4401481443785, 45.5943284100213, 45.3671411300974 |
| Outdoor.ozone.pollution..deaths.per.100.000. | double | 5.61644203074918, 5.60396011603667, 5.61182206482564, 5.65526606275628, 5.71892222061506, 5.73917378233707 |
| variable | class | first_values |
|---|---|---|
| Country.Name | character | Aruba, Afghanistan, Angola, Albania, Andorra, Arab World |
| Country.Code | character | ABW, AFG, AGO, ALB, AND, ARB |
| Indicator.Name | character | GDP (current US\(), GDP (current US\)), GDP (current US\(), GDP (current US\)), GDP (current US\(), GDP (current US\)) |
| Indicator.Code | character | NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD |
| X1960 | double | NA, 537777811.111111, NA, NA, NA, NA |
| X1961 | double | NA, 548888895.555556, NA, NA, NA, NA |
| X1962 | double | NA, 546666677.777778, NA, NA, NA, NA |
| X1963 | double | NA, 751111191.111111, NA, NA, NA, NA |
| X1964 | double | NA, 800000044.444444, NA, NA, NA, NA |
| X1965 | double | NA, 1006666637.77778, NA, NA, NA, NA |
| variable | class | first_values |
|---|---|---|
| SDGRegion | character | SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA |
| SubRegion | character | Eastern Africa, Eastern Africa, Eastern Africa, Eastern Africa, Eastern Africa, Eastern Africa |
| Country | character | Burundi, Comoros, Djibouti, Eritrea, Ethiopia, Kenya |
| Notes | integer | NA, NA, NA, NA, NA, NA |
| Country.code | integer | 108, 174, 262, 232, 231, 404 |
| Type | character | Country/Area, Country/Area, Country/Area, Country/Area, Country/Area, Country/Area |
| Parent.code | integer | 910, 910, 910, 910, 910, 910 |
| X1950 | character | 2 309, 159, 62, 822, 18 128, 6 077 |
| X1951 | character | 2 360, 163, 63, 835, 18 467, 6 242 |
| X1952 | character | 2 406, 167, 65, 849, 18 820, 6 416 |
| variable | class | first_values |
|---|---|---|
| iso_a2 | character | FJ, TZ, EH, CA, US, KZ |
| name_long | character | Fiji, Tanzania, Western Sahara, Canada, United States, Kazakhstan |
| continent | character | Oceania, Africa, Africa, North America, North America, Asia |
| region_un | character | Oceania, Africa, Africa, Americas, Americas, Asia |
| subregion | character | Melanesia, Eastern Africa, Northern Africa, Northern America, Northern America, Central Asia |
| type | character | Sovereign country, Sovereign country, Indeterminate, Sovereign country, Country, Sovereign country |
| area_km2 | double | 19289.9707329765, 932745.792357074, 96270.6010408472, 10036042.9767873, 9510743.74482458, 2729810.51298781 |
| pop | double | 885806, 52234869, NA, 35535348, 318622525, 17288285 |
| lifeExp | double | 69.96, 64.163, NA, 81.9530487804878, 78.8414634146341, 71.62 |
| gdpPercap | double | 8222.25378436842, 2402.09940362843, NA, 43079.1425247165, 51921.9846391384, 23587.3375151466 |
Clean Data
First thing that we need to drop unnecessary columns and set datatypes (factor, num, etc.).
Clean air_pollution_df:
| variable | class | first_values |
|---|---|---|
| Country | integer | Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan |
| ISO.alpha3.code | integer | AFG, AFG, AFG, AFG, AFG, AFG |
| Year | integer | 1990, 1991, 1992, 1993, 1994, 1995 |
| Deaths.Air.Pollution.per.100k | double | 299.477308883281, 291.277966734046, 278.963055615066, 278.790814746341, 287.162923177255, 288.01422374243 |
Clean gdp_df:
| variable | class | first_values |
|---|---|---|
| Country | integer | Aruba, Aruba, Aruba, Aruba, Aruba, Aruba |
| ISO.alpha3.code | integer | ABW, ABW, ABW, ABW, ABW, ABW |
| Year | integer | 1986, 1987, 1988, 1989, 1990, 1991 |
| GDP.USD | double | 405463417.11746, 487602457.746416, 596423607.114715, 695304363.031101, 764887117.194486, 872138715.083799 |
Clean population_region_df:
| variable | class | first_values |
|---|---|---|
| SDGRegion | integer | SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA |
| SubRegion | integer | EasternAfrica, EasternAfrica, EasternAfrica, EasternAfrica, EasternAfrica, EasternAfrica |
| Country | integer | Burundi, Burundi, Burundi, Burundi, Burundi, Burundi |
| M49.code | integer | 108, 108, 108, 108, 108, 108 |
| Year | integer | 1950, 1951, 1952, 1953, 1954, 1955 |
| Population.thousands | double | 2309, 2360, 2406, 2449, 2492, 2537 |
Clean population_region_df:
| variable | class | first_values |
|---|---|---|
| Country.or.Area | integer | Andorra, United Arab Emirates (the), Afghanistan, Antigua and Barbuda, Anguilla, Albania |
| ISO.alpha2.code | integer | AD, AE, AF, AG, AI, AL |
| ISO.alpha3.code | integer | AND, ARE, AFG, ATG, AIA, ALB |
| M49.code | integer | 20, 784, 4, 28, 660, 8 |
Clean world:
| variable | class | first_values |
|---|---|---|
| iso_a2 | integer | FJ, TZ, EH, CA, US, KZ |
Note that we only have geometries for 175 countries, some will not be able to be plot on a map but that is okay.
Final DataFrame Construction
Now let’s merge our 4 datasets into one using a series of inner joins using country code and year as keys depending on the specific join. We are using inner joins because we want to drop all null values which would mean either a country does not have a country code or we have more years of data than our smallest year range (the air pollution dataset).
| variable | class | first_values |
|---|---|---|
| ISO.alpha2.code | integer | AD, AD, AD, AD, AD, AD |
| M49.code | integer | 20, 20, 20, 20, 20, 20 |
| Year | integer | 2012, 2013, 1990, 1991, 1992, 1993 |
| ISO.alpha3.code | integer | AND, AND, AND, AND, AND, AND |
| Country.x | integer | Andorra, Andorra, Andorra, Andorra, Andorra, Andorra |
| Deaths.Air.Pollution.per.100k | double | 17.6754871826169, 17.1893417774086, 29.0238806202567, 28.6956788863825, 28.4603211317312, 27.8408717612189 |
| GDP.USD | double | 3188808942.56713, 3193704343.20627, 1029048481.88051, 1106928582.86629, 1210013651.87713, 1007025755.00065 |
| SDGRegion | integer | EUROPE, EUROPE, EUROPE, EUROPE, EUROPE, EUROPE |
| SubRegion | integer | SouthernEurope, SouthernEurope, SouthernEurope, SouthernEurope, SouthernEurope, SouthernEurope |
| Population.thousands | double | 82, 81, 55, 57, 59, 61 |
| geom | list | list(), list(), list(), list(), list(), list() |
| gdp.per.capita | double | 38887913.9337455, 39428448.6815589, 18709972.3978275, 19419799.6994086, 20508705.9640192, 16508618.9344369 |
Our dataset is finally ready to be analyzed.
4. EDA - Exploratory Data Analysis
Quick Plots
Let’s start our EDA process by just looking at some quick plots to look at the distribution of data.
Histogram of Air Pollution Induced Deaths, Population, and GDP per Capita
- Figure 16: Histogram of Air Pollution Induced Deaths per 100k.
- Figure 17: Histogram of Population.
- Figure 18: Histogram of GDP per Capita.
Looks like deaths.air.pollution.per.100k, population, and gdp.per.capita are not normal and are all right skewed.
Boxplot of Air Pollution Induced Deaths, Population, and GDP per Capita
Let’s look at a boxplot for the outliers.
- Figure 19: Boxplot of Deaths per 100,000 from Air Pollution vs SDG Region
Interesting to note that Australia/New Zealand, Europe, North America seem to have the lowest deaths per 100k from air pollution and are all fairly compactly packed together (low variance) relative to other regions around the world. Furthermore, these region contain the most advanced countries.
Let’s take another look but at SubRegions.
- Figure 20: Boxplot of Deaths per 100k from Air Pollution vs Sub Region
Separating out into an even granular grouping of regions show some trends where Australia/New Zealand, North America, Northern Europe, and Western Europe all have low deaths per 100k and have low variance. Historically, these regions consist of countries that have been considered ‘First World’ before our first year of analysis of 1990. We will dig into this more later in our SMART questions.
What does the GDP per capita of these regions look like comparatively? Let’s take a look.
- Figure 21: Boxplot of GDP per Capita vs Sub Region
Interesting to observe that the same subregions that have low deaths caused by air pollution also have high GDP per capita comparatively. We will try to see if we can quantify this relationship later on in our main research analysis.
Map of Countries
Due to the nature of our data set, plotting maps and maps with intensities can add another dimension to how we visualize and therefore understand our data.
- Figure 22: Global Map of SDGRegions and SubRegions
- Figure 23: Global Intensity Map of Key Numerical Features, 1990 to 2017
Looks like some inverse correlation between gdp.per.capita and deaths.air.pollution.per.100k.
We can also use ggplot2 to have a bit more control over map plotting.
- Figure 24: Global Intensity Map of Deaths due to Air Pollution per 100k People, 1990 to 2017
- Figure 25: Intensity Map of Deaths due to Air Pollution per 100k People in East and Southeastern Asia, 2017
SMART Questions
1. Is there a relationship between population size and Deaths per 100,000 due to air pollution?
Below, we would like to measure the relationship between Population size (in thousands) and Deaths per 100,000 due to air pollution. Since these variables are numerical, we have to confirm the normal distribution of both variables, and from the results below, we see that there is no correlation between a country’s population size and their deaths due to air pollution. we do observe a negative correlation between Deaths due to air pollution and GDP per Capita.
## 'data.frame': 5197 obs. of 12 variables:
## $ ISO.alpha2.code : Factor w/ 248 levels "AD","AE","AF",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ M49.code : Factor w/ 249 levels "4","8","10","12",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Year : Factor w/ 28 levels "1990","1991",..: 23 24 1 2 3 4 5 6 7 8 ...
## $ ISO.alpha3.code : Factor w/ 197 levels "","AFG","AGO",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Country.x : Factor w/ 231 levels "Afghanistan",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Deaths.Air.Pollution.per.100k: num 17.7 17.2 29 28.7 28.5 ...
## $ GDP.USD : num 3188808943 3193704343 1029048482 1106928583 1210013652 ...
## $ SDGRegion : Factor w/ 9 levels "AUSTRALIA/NEWZEALAND",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ SubRegion : Factor w/ 22 levels "AUSTRALIA/NEWZEALAND",..: 19 19 19 19 19 19 19 19 19 19 ...
## $ Population.thousands : num 82 81 55 57 59 61 63 64 64 64 ...
## $ geom :sfc_MULTIPOLYGON of length 5197; first list element: list()
## ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
## $ gdp.per.capita : num 38887914 39428449 18709972 19419800 20508706 ...
## [1] 0.037
## Deaths.Air.Pollution.per.100k
## Deaths.Air.Pollution.per.100k 1.000
## Population.thousands 0.069
## gdp.per.capita -0.543
## Population.thousands gdp.per.capita
## Deaths.Air.Pollution.per.100k 0.069 -0.543
## Population.thousands 1.000 -0.040
## gdp.per.capita -0.040 1.000
| Deaths.Air.Pollution.per.100k | Population.thousands | gdp.per.capita | |
|---|---|---|---|
| Deaths.Air.Pollution.per.100k | 1.000 | 0.069 | -0.543 |
| Population.thousands | 0.069 | 1.000 | -0.040 |
| gdp.per.capita | -0.543 | -0.040 | 1.000 |
3. Which regions have the lowest and highest deaths due to air pollution?
## grouped_df [252 × 3] (S3: grouped_df/tbl_df/tbl/data.frame)
## $ SDGRegion: Factor w/ 9 levels "AUSTRALIA/NEWZEALAND",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : Factor w/ 28 levels "1990","1991",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ total : num [1:252] 50.5 49.1 48.9 47.3 45.9 ...
## - attr(*, "groups")= tibble [9 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ SDGRegion: Factor w/ 9 levels "AUSTRALIA/NEWZEALAND",..: 1 2 3 4 5 6 7 8 9
## ..$ .rows : list<int> [1:9]
## .. ..$ : int [1:28] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ : int [1:28] 29 30 31 32 33 34 35 36 37 38 ...
## .. ..$ : int [1:28] 57 58 59 60 61 62 63 64 65 66 ...
## .. ..$ : int [1:28] 85 86 87 88 89 90 91 92 93 94 ...
## .. ..$ : int [1:28] 113 114 115 116 117 118 119 120 121 122 ...
## .. ..$ : int [1:28] 141 142 143 144 145 146 147 148 149 150 ...
## .. ..$ : int [1:28] 169 170 171 172 173 174 175 176 177 178 ...
## .. ..$ : int [1:28] 197 198 199 200 201 202 203 204 205 206 ...
## .. ..$ : int [1:28] 225 226 227 228 229 230 231 232 233 234 ...
## .. ..@ ptype: int(0)
## ..- attr(*, ".drop")= logi TRUE
4. How does deaths due to air pollution increase over time? More specifically, are death rates in recent X amount of years higher than death rates from groups of X years before?
5. Main Research Question
Do lower GDP countries have more deaths per 100k due to air pollution?
Is there a correlation between GDP per capita and deaths caused by pollution? Is it linear? How strong is the correlation?
Linear Fit
Let’s first look at the general fit on the overall data.
- Figure XX: Linear model (fit1) on overall data, deaths due to air pollution per 100k vs GDP per capita, 1990 to 2017.
From the plot, we observe that there is indeed a negative correlation between deaths due to air pollution per 100k and GDP per capita. However, the strength of that relationship is not particularly strong as the R2 is really low at 0.295. This means that only 29% of the variance experienced in deaths due to air pollution per 100k is caused by GDP per capita in a linear relationship.
Even looking at each individual SDGRegion, their linear fits get better overall but are still not particularly strong with the highest being Australia/New Zealand and Europe at R2 of 0.56 and 0.55 respectively.
- Figure XX: Linear models for each SDGRegion, deaths due to air pollution per 100k vs GDP per capita, 1990 - 2017.
Let’s now look at how slicing by annual changes plays a part. : Figure XX: Linear models for each Year, deaths due to air pollution per 100k vs GDP per capita, 1990 - 2017.
As observed, time does not seem to play a significant part in describing the relationship between deaths due to air pollution per 100k vs GDP per capita as the R2 stays roughly constant around 0.3 across all the years.
Transformed Log Scale - Linear Fit
Perhaps we should look at a non-linear fit. From our visuals, we see that every plot starts off at really high deaths due to air pollution per 100k then drops off dramatically as GDP per capita increases. However, the drop off begins to tamper off and asymptotically approaches some value. (It will be interesting to see if we can generalize what that GDP per capita value is. Let’s table that for later.) We have seen this type of behavior before in log graphs such as one shown below.
- Figure XX: Sample log graph.
Our data seems to be a -log(x) instead of log(x). Let’s transform our linear fit to a log fit by wrapping our features into a log() function and fitting back to a linear fit and see what the relationship is.
fit2’s summary statistics are:
##
## Call:
## lm(formula = log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita),
## data = final_df_sf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.099 -0.235 0.000 0.206 1.431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.07849 0.04871 207 <0.0000000000000002 ***
## log(gdp.per.capita) -0.38952 0.00323 -121 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.369 on 5195 degrees of freedom
## Multiple R-squared: 0.737, Adjusted R-squared: 0.737
## F-statistic: 1.45e+04 on 1 and 5195 DF, p-value: <0.0000000000000002
Let’s replot with this new fit.
- Figure XX Fitting to a log(y) = (m)(log(x)) + b curve yields much stronger relationship by SDGRegion.
- Figure XX Fitting to a log(y) = (m)(log(x)) + b curve yields much stronger relationship by years.
- Figure XX Fitting to a log(y) = (m)(log(x)) + b curve yields much stronger relationship.
Across the board, the strength of our linear relationship increases dramatically when first transforming both features by the log() function first. The new R2 is now 0.737 which means around 74% of the variance in our target feature can be explained by this mathematical relationship.
Let’s test a few more regression models by adding more features and see what happens.
fit3 <- lm(log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita)*SubRegion, data=final_df_sf)
fit4 <- lm(log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita)*SubRegion+Year, data=final_df_sf)
fit5 <- lm(log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita)*Year, data=final_df_sf)R2 values for adding more features are in fit3, fit4, and fit5 are 0.883, 0.887, and 0.747 respectively.
Let’s check out the VIFs to see if we should keep any of our new models.
| log(gdp.per.capita) | log(gdp.per.capita):SubRegionCaribbean | log(gdp.per.capita):SubRegionCentralAmerica | log(gdp.per.capita):SubRegionCentralAsia | log(gdp.per.capita):SubRegionEasternAfrica | log(gdp.per.capita):SubRegionEasternAsia | log(gdp.per.capita):SubRegionEasternEurope | log(gdp.per.capita):SubRegionMelanesia | log(gdp.per.capita):SubRegionMicronesia | log(gdp.per.capita):SubRegionMiddleAfrica | log(gdp.per.capita):SubRegionNorthernAfrica | log(gdp.per.capita):SubRegionNORTHERNAMERICA | log(gdp.per.capita):SubRegionNorthernEurope | log(gdp.per.capita):SubRegionPolynesia | log(gdp.per.capita):SubRegionSouth-EasternAsia | log(gdp.per.capita):SubRegionSouthAmerica | log(gdp.per.capita):SubRegionSouthernAfrica | log(gdp.per.capita):SubRegionSouthernAsia | log(gdp.per.capita):SubRegionSouthernEurope | log(gdp.per.capita):SubRegionWesternAfrica | log(gdp.per.capita):SubRegionWesternAsia | log(gdp.per.capita):SubRegionWesternEurope | SubRegionCaribbean | SubRegionCentralAmerica | SubRegionCentralAsia | SubRegionEasternAfrica | SubRegionEasternAsia | SubRegionEasternEurope | SubRegionMelanesia | SubRegionMicronesia | SubRegionMiddleAfrica | SubRegionNorthernAfrica | SubRegionNORTHERNAMERICA | SubRegionNorthernEurope | SubRegionPolynesia | SubRegionSouth-EasternAsia | SubRegionSouthAmerica | SubRegionSouthernAfrica | SubRegionSouthernAsia | SubRegionSouthernEurope | SubRegionWesternAfrica | SubRegionWesternAsia | SubRegionWesternEurope |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 975 | 7945 | 5044 | 3147 | 9101 | 2492 | 5911 | 2988 | 2640 | 5154 | 3892 | 3611 | 5909 | 1944 | 5987 | 7157 | 3933 | 5166 | 7168 | 9144 | 9611 | 5771 | 6747 | 3902 | 2131 | 5646 | 2101 | 4744 | 2281 | 2106 | 3457 | 2936 | 3714 | 5880 | 1584 | 4427 | 5680 | 3037 | 3437 | 6334 | 5707 | 8049 | 5965 |
| log(gdp.per.capita) | log(gdp.per.capita):SubRegionCaribbean | log(gdp.per.capita):SubRegionCentralAmerica | log(gdp.per.capita):SubRegionCentralAsia | log(gdp.per.capita):SubRegionEasternAfrica | log(gdp.per.capita):SubRegionEasternAsia | log(gdp.per.capita):SubRegionEasternEurope | log(gdp.per.capita):SubRegionMelanesia | log(gdp.per.capita):SubRegionMicronesia | log(gdp.per.capita):SubRegionMiddleAfrica | log(gdp.per.capita):SubRegionNorthernAfrica | log(gdp.per.capita):SubRegionNORTHERNAMERICA | log(gdp.per.capita):SubRegionNorthernEurope | log(gdp.per.capita):SubRegionPolynesia | log(gdp.per.capita):SubRegionSouth-EasternAsia | log(gdp.per.capita):SubRegionSouthAmerica | log(gdp.per.capita):SubRegionSouthernAfrica | log(gdp.per.capita):SubRegionSouthernAsia | log(gdp.per.capita):SubRegionSouthernEurope | log(gdp.per.capita):SubRegionWesternAfrica | log(gdp.per.capita):SubRegionWesternAsia | log(gdp.per.capita):SubRegionWesternEurope | SubRegionCaribbean | SubRegionCentralAmerica | SubRegionCentralAsia | SubRegionEasternAfrica | SubRegionEasternAsia | SubRegionEasternEurope | SubRegionMelanesia | SubRegionMicronesia | SubRegionMiddleAfrica | SubRegionNorthernAfrica | SubRegionNORTHERNAMERICA | SubRegionNorthernEurope | SubRegionPolynesia | SubRegionSouth-EasternAsia | SubRegionSouthAmerica | SubRegionSouthernAfrica | SubRegionSouthernAsia | SubRegionSouthernEurope | SubRegionWesternAfrica | SubRegionWesternAsia | SubRegionWesternEurope | Year1991 | Year1992 | Year1993 | Year1994 | Year1995 | Year1996 | Year1997 | Year1998 | Year1999 | Year2000 | Year2001 | Year2002 | Year2003 | Year2004 | Year2005 | Year2006 | Year2007 | Year2008 | Year2009 | Year2010 | Year2011 | Year2012 | Year2013 | Year2014 | Year2015 | Year2016 | Year2017 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 987 | 8010 | 5074 | 3170 | 9188 | 2516 | 5949 | 2997 | 2662 | 5201 | 3917 | 3613 | 5952 | 1951 | 6051 | 7192 | 3955 | 5204 | 7232 | 9195 | 9704 | 5774 | 1.93 | 1.94 | 1.95 | 1.96 | 1.99 | 1.99 | 1.99 | 1.99 | 1.99 | 2.02 | 2.02 | 2.05 | 2.05 | 2.06 | 2.07 | 2.08 | 2.09 | 2.11 | 2.1 | 2.11 | 2.12 | 2.12 | 2.13 | 2.13 | 2.11 | 2.1 | 2.11 | 6800 | 3922 | 2144 | 5695 | 2121 | 4771 | 2285 | 2122 | 3486 | 2952 | 3717 | 5922 | 1587 | 4473 | 5704 | 3051 | 3458 | 6390 | 5730 | 8125 | 5967 |
| log(gdp.per.capita) | log(gdp.per.capita):Year1991 | log(gdp.per.capita):Year1992 | log(gdp.per.capita):Year1993 | log(gdp.per.capita):Year1994 | log(gdp.per.capita):Year1995 | log(gdp.per.capita):Year1996 | log(gdp.per.capita):Year1997 | log(gdp.per.capita):Year1998 | log(gdp.per.capita):Year1999 | log(gdp.per.capita):Year2000 | log(gdp.per.capita):Year2001 | log(gdp.per.capita):Year2002 | log(gdp.per.capita):Year2003 | log(gdp.per.capita):Year2004 | log(gdp.per.capita):Year2005 | log(gdp.per.capita):Year2006 | log(gdp.per.capita):Year2007 | log(gdp.per.capita):Year2008 | log(gdp.per.capita):Year2009 | log(gdp.per.capita):Year2010 | log(gdp.per.capita):Year2011 | log(gdp.per.capita):Year2012 | log(gdp.per.capita):Year2013 | log(gdp.per.capita):Year2014 | log(gdp.per.capita):Year2015 | log(gdp.per.capita):Year2016 | log(gdp.per.capita):Year2017 | Year1991 | Year1992 | Year1993 | Year1994 | Year1995 | Year1996 | Year1997 | Year1998 | Year1999 | Year2000 | Year2001 | Year2002 | Year2003 | Year2004 | Year2005 | Year2006 | Year2007 | Year2008 | Year2009 | Year2010 | Year2011 | Year2012 | Year2013 | Year2014 | Year2015 | Year2016 | Year2017 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 34.7 | 187 | 179 | 181 | 177 | 183 | 185 | 186 | 185 | 183 | 185 | 186 | 187 | 187 | 189 | 191 | 194 | 197 | 202 | 209 | 213 | 216 | 219 | 221 | 223 | 224 | 222 | 222 | 187 | 180 | 181 | 177 | 184 | 187 | 189 | 187 | 185 | 187 | 188 | 190 | 192 | 196 | 200 | 205 | 210 | 217 | 223 | 228 | 232 | 236 | 239 | 240 | 240 | 238 | 238 |
Although adding more features into our regression model results in higher R2 values, the Variance Inflation Factor (VIF) for each are extremely high so we will reject those models as those added features are highly correlated with each other. Therefore, we will stick with our second model fit2.
We can then predict a country’s deaths caused from air pollution in a given year by using the country’s GDP per capita with the following equation:
\[ log(Deaths_{from~air~pollution|per~year|per~country} / 100,000) = 10.07849 - 0.38952 * log(GDP_{per capita}) ~~~~~~~~~~~~~~~~ eqn (1) \]
or solving for our target variable:
\[ Deaths_{from~air~pollution|per~year|per~country} = 10^{10.07849 - 0.38952 * log(GDP per capita)} * 100,000 ~~~~~~~~~~~~~~~~ eqn (2) \]
Is there a difference in means of death caused by pollution between low, mid, and high GDP per capita?
We all know that correlation does not necessarily mean causation. Let us dig a little deeper and test if means of deaths caused by air pollution per 100k across different GDP per capita levels are equal or not.
One-Way ANOVA Test
We start off by performing a One-Way ANOVA test to determine if the means of deaths caused by air pollution per 100k across different GDP per capita levels are equal or not.
H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp
H1: At least one of \(\mu\)deaths_lowest_gdp, \(\mu\)deaths_low_gdp, \(\mu\)deaths_medium_gdp, \(\mu\)deaths_high_gdp is not equal
We will sse an \(\alpha\) value of 0.05.
The p-valuetest1 is 0e+00, which is lower than \(\alpha\)0.05. Therefore, we reject our null hypothesis that \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp. This means that there is statistically significant that at least one of the means of deaths in low, medium, and high GDP per capita are not the same.
2-Sample T-Tests
We will conduct 6 2-sample t-tests to determine if each of the groupings are different from each other:
- Lowest GDP per capita’s deaths does not equal Low GDP per capita’s deaths
- H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp
- H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_low_gdp
- Low GDP per capita’s deaths does not equal Medium GDP per capita’s deaths
- H0: \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp
- H1: \(\mu\)deaths_low_gdp != \(\mu\)deaths_medium_gdp
- Medium GDP per capita’s deaths does not equal High GDP per capita’s deaths
- H0: \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp
- H1: \(\mu\)deaths_medium_gdp != \(\mu\)deaths_high_gdp
- Lowest GDP per capita’s deaths does not equal High GDP per capita’s deaths
- H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_high_gdp
- H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_high_gdp
- Lowest GDP per capita’s deaths does not equal Medium GDP per capita’s deaths
- H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_medium_gdp
- H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_medium_gdp
- Low GDP per capita’s deaths does not equal Highest GDP per capita’s deaths
- H0: \(\mu\)deaths_low_gdp = \(\mu\)deaths_high_gdp
- H1: \(\mu\)deaths_low_gdp != \(\mu\)deaths_high_gdp
We will use a two sample t-test for each and use an \(\alpha\) value of 0.05.
Test 1:
p-valuetest1: 2.99e-203
p-valuetest1 < \(\alpha\)0.05 = TRUE
Conclusion of test1: p-valuetest1 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_low_gdp and accept our alternative hypothesis.
Test 2:
p-valuetest2: 1.47e-13
p-valuetest2 < \(\alpha\)0.05 = TRUE
Conclusion of test2: p-valuetest2 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_low_gdp is equal to \(\mu\)deaths_medium_gdp and accept our alternative hypothesis.
Test 3:
p-valuetest3: 0e+00
p-valuetest3 < \(\alpha\)0.05 = TRUE
Conclusion of test3: p-valuetest3 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_medium_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.
Test 4:
p-valuetest4: 2.91e-06
p-valuetest4 < \(\alpha\)0.05 = TRUE
Conclusion of test4: p-valuetest4 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.
Test 5:
p-valuetest5: 4.79e-48
p-valuetest5 < \(\alpha\)0.05 = TRUE
Conclusion of test5: p-valuetest5 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_medium_gdp and accept our alternative hypothesis.
Test 6:
p-valuetest6: 4.17e-70
p-valuetest6 < \(\alpha\)0.05 = TRUE
Conclusion of test6: p-valuetest6 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_low_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.
6. Conclusion
Main Research Results
From all of our tests, we can confirm that the means of deaths caused by air pollution are statistically significant when grouped by different levels of GDP per capita. This reinforces the idea that deaths caused by air pollution has a significant relationship with GDP per capita and the model can be quantified by Equation 2:
\[ Deaths_{from~air~pollution|per~year|per~country} = 10^{10.07849 - 0.38952 * log(GDP per capita)} * 100,000 ~~~~~~~~~~~~~~~~ eqn (2) \]
The strength of the correlation can be quantified by our R2 value of 0.737 from Figure XX.
Areas of Further Analysis
This data set has many avenues for further statistical analysis and modeling. Some potential areas for further analysis include:
- Can we quantify or create a mathematical model of what GDP per capita value reaches the asymptotic relationship we observed in the log data set?
- Can we build a better performing predictor for Deaths per 100k due to Air Pollution using more powerful models (Random Forests, Gradient Boosting, SVMs) and/or by including more features?
7. Bibliography
| Number | APA Citation |
|---|---|
| 1 | Robin Lovelace, J. N. (n.d.). Chapter 8 Making maps with R: Geocomputation with R. Retrieved October 28, 2021, from https://geocompr.robinlovelace.net/adv-map.html |
| 2 | Robin Lovelace, J. N. (2021, October 28). Chapter 2 Geographic data in R: Geocomputation with R. Retrieved from https://geocompr.robinlovelace.net/spatial-class.html#intro-sf |
| 3 | Hadley Wickham, D. N. (2021, October 28). 6 Maps. Retrieved from https://ggplot2-book.org/maps.html |
| 4 | Customizing ggplot2 color and fill scales. (2021, October 28). Retrieved from https://spielmanlab.github.io/introverse/articles/color_fill_scales.html |
| 5 | Logarithmic Functions. (2021, October 28). Retrieved from https://saylordotorg.github.io/text_intermediate-algebra/s10-03-logarithmic-functions-and-thei.html |